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ABSTRACT: Learning analytics has reserved its position as an important field in the educational 
sector. However, the large-scale collection, processing, and analyzing of data has steered the 
wheel beyond the borders to face an abundance of ethical breaches and constraints. Revealing 
learners' personal information and attitudes, as well as their activities, are major aspects that 
lead to identifying individuals personally. Yet, de-identification can keep the process of learning 
analytics in progress while reducing the risk of inadvertent disclosure of learners' identities. In 
this paper, the authors discuss de-identification methods in the context of the learning 
environment and propose a first prototype conceptual approach that describes the combination 
of anonymization strategies and learning analytics techniques. 
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1 INTRODUCTION 

Learning analytics is an active area of the research field of online education and Technology Enhanced 
Learning (TEL). It applies analysis techniques to the education data stream in order to achieve several 
objectives. These objectives mainly aim to intervene and predict learners' performance in pursuance of 
enhancing the learning context and its environment. Higher Education (HE) and online course 
institutions are looking at learning analytics with an interest in improving retention and decreasing the 
total dropout rate (Slade & Galpin, 2012). However, ethical issues emerge while applying learning 
analytics in educational data sets (Greller & Drachsler, 2012). At the first International Conference on 
Learning Analytics and Knowledge (LAK 'll), held in Banff, Alberta, Canada in 2011, participants agreed 
that learning analytics raises issues relevant to ethics and privacy and "it could be construed as 
eavesdropping" (Brown, 2011). The massive data collection and analysis of these educational data sets 
can lead to questions related to ownership, transparency, and privacy of data. These issues are not 
unique to the education sector only, but can be found in the human resource management and health 
sectors (Cooper, 2009). At its key level, learning analytics involves tracking students' steps in learning 
environments, such as videos of MOOCs (Wachtler, Khalil, Taraghi & Ebner, 2016), in the interest of 
identifying who are the students "at risk," or to help students with decisions about their futures. 
Nevertheless, tracking interactions of students could unveil critical issues regarding their privacy and 
their identities (Boyd, 2008). 

Ethical issues for learning analytics fall into different categories. We mainly summarize them as the 
following (Khalil & Ebner, 2015b): 1) transparency of data collection, usage, and involvement of third 
parties; 2) anonymization and de-identification of individuals; 3) ownership of data; 4) data accessibility 
and accuracy of the analyzed results; 5) security of the examined data sets and student records from any 
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threat. These criteria point to the widely based security model CIA, which stands for Confidentiality, 
Integrity from alteration, and Availability for authorized parties. 

The learning analytics community needs to deal carefully with the potential privacy issues while 
analyzing student data. Educational data analysis techniques can reveal personal information, attitudes, 
and activities related to learners (Bienkowski, Feng, & Means, 2012). However, there has been limited 
research, and there are still numerous unanswered questions related to privacy, personal information, 
and other ethical issues in the context of learning analytics (Bienkowski, Feng, & Means, 2012; Greller & 
Drachsler, 2012; Slade & Galpin, 2012; Slade & Prinsloo, 2013). For example, some educators claim that 
educational institutions are using applications that collect sensitive data about students without 
sufficiently respecting data privacy and how the data will eventually be used (Singer, 2014). Thus, data 
degradation (Anciaux et al., 2008), de-identification methods, or deletion of specific data records, may 
be required as a solution to preserve learners' information. In this paper, we will mainly focus our 
discussion on the de-identification process in the learning analytics atmosphere and afford a first 
prototype conceptual approach that combines learning environment, de-identification techniques, and 
learning analytics. 

The paper is organized as follows: Section 2 covers the de-identification in general and the current laws 
associated with education, as well as the drivers linked with learning analytics. In Section 3, we propose 
the de-identification-learning analytics approach. The last section discusses the limitations of the de¬ 
identification process in learning analytics. 

2 BACKGROUND 

2.1 Personal Information and De-Identification 

Personal information is any information that can identify an individual. In fields such as the health 
sector, it is named Personal Health Information or PHI. While in other fields, such as the education 

sector, this information is named Personal Identifiable Information or Pll. The National Institute of 

Standards and Technology (NIST) defines Pll as "any information about an individual maintained by an 
agency, including 1) any information that can be used to distinguish or trace an individual's identity, 
such as name, social security number, date and place of birth, mother's maiden name, or biometric 
records; and 2) any other information that is linked or linkable to an individual, such as medical, 
educational, financial, and employment information" (McCallister, Grance, & Scarfone, 2010). The 
personal information of learners can be categorized into details such as name, sex, photograph, date of 
birth, age, address, religion, marital status, e-mail address, insurance number, ethnicity, et cetera, or 
educational details such as qualifications, courses attended, degrees, and study records. As a criterion, a 
leak of individuals' personal information can induce misuse of data, embarrassment, and loss of 
reputation. However, organizations may be required to publish details extracted from personal 
information. For instance, some educational institutions are required to provide statistics about student 
progress; likewise, health organizations may need to report special cases from their patient records, 
such as communicable diseases. As a result, de-identification helps organizations to protect privacy 
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while still informing the public. The de-identification process is used to prevent revealing individual 
identity and keeping the Pll confidential. 

In learning analytics, it is common for stakeholders to request additional information about the results 
extracted from educational data sets. Educational data mining and learning analytics mainly aim to 
enhance the learning environment and empower learners and instructors (Greller & Drachsler, 2012). 
Therefore, the analysis of these data may have interesting trends that could lead to further and deeper 
analysis by other institutions or researchers. Requests for more extensive analysis may involve the use 
of student-level data. Accordingly, ethical issues arise, such as privacy disclosure, and the need to de- 
identify the data becomes paramount. 

Recently, Harvard and MIT universities released de-identified data from 16 courses offered in 2012- 
2013 from their well-known edX Massive Open Online Course (MOOC) (MIT News, 2014). The Harvard 
and MIT edX ensures that the anonymity of the released data complies with the Family Educational 
Rights and Privacy Act (FERPA). 1 Furthermore, Prinsloo and Slade (2015) suggested different approaches 
that inform students in higher education of the implications of learning analytics on their private data. 

2.2 De-Identification Legislation 

De-identification of student records has been regulated in the United States and the European Union. 
The United States adopted FERPA regarding the privacy of student educational records. In the European 
Union, the Data Protection Directive (DPD; 95/46/EC 2 ) regulates the processing of personal data and the 
movement of such information. FERPA §99.31(b) deals with the de-identification of data rule. It clearly 
states that institutions "may release, without consent, education records, or information from education 
records, that has been de-identified through the removal of all Personally Identifiable Information (Pll)." 
This section of FERPA requires institutions to use reasonable methods to identify the other parties who 
disclose education records. On the other hand, the most explicit citation of de-identification in the 
European DPD is Article 26 on anonymization, in which "principles of data protection shall not apply to 
data rendered anonymous in such a way that the data subject is no longer identifiable." Moreover, 
parties are encouraged to use de-identification techniques to render identification of data subjects 
impossible. It is not obvious, however, what level of de-identification is required to anonymize 
education records under European law. However, the Article 29 Data Protection Working Party has an 
opinion on the identification of data: "Once a data set is truly anonymized and individuals are no longer 
identifiable, European data protection law no longer applies" (2014, p. 5). 

2.3 Drivers of De-Identification in Learning Analytics 

A study by Peterson (2012), addressed the need to de-identify data used in academic analysis before 
making it available to institutions, to businesses, or for operational functions. Peterson (2012) pointed 


1 http://www2.ed.gov/policy/gen/guid/fpco/ferpa/index.html (last access January 2015) 

2 http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri= CELEX:31995L0046:EN:HTML(last access January 2015) 
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to the idea of keeping a unique identifier in case a researcher may need to study the behaviour of a 
particular individual. Slade and Prinsloo (2013), however, drew attention to the ambiguity of data 
mining techniques in monitoring student behaviour in educational settings. The authors linked de¬ 
identification with consent and privacy and stressed the need to guarantee student anonymity in their 
education records in order to achieve learning analytics objectives such as interventions based on 
student characteristics. An example of the link between consent and de-identification would be a 
questionnaire or survey that those filling it out are told will be used for research only. In that case, 
clearly the limitation of using their data will be just the one study. If the survey includes personal 
information, however, then assurances of anonymizing their data should be considered. 

Ryan Baker (2013) discussed the demands of de-identifying educational data sets in his "Learning, 
Schooling, and Data Analytics" chapter in the Handbook on Innovations in Learning for States, Districts, 
and Schools. De-identification of these data sets means being able to share them among other 
researchers without violating FERPA regulations. Baker stressed that educational policies should include 
rules for anonymizing data in order to prevent identifiable information from being leaked without 
authorization. Furthermore, Drachsler and Greller covered the topic of anonymization in their DELICATE 
approach (Drachsler & Greller, 2016). A "strictly guarded key" should be held so that researchers may 
link their results from learning analytics and educational data mining with individual students in order to 
benefit the students. De-identification techniques have been reviewed as a right of access principle in 
learning analytics deployment (Pardo & Siemens, 2014). In addition, Pardo and Siemens further suggest 
that semantic analysis might be required to detect identifiable records in anonymized data sets. 

3 PROPOSED APPROACH 

In this section, we propose a conceptual de-identification-learning analytics framework as shown in 
Figure 1. The framework begins with learners involved in learning environments. Currently, a large 
number of learning environments support online learning, such as MOOCS, Learning Management 
Systems (LMS), Immersive Learning Simulations (ILS), mobile learning, and Personalized Learning 
Environments (PLE). These platforms offer environments with rich, vast amounts of data that can be 
quantitatively/qualitatively analyzed to benefit learners and enhance the learning context. 


ISSN 1929-7750 (online). The Journal of Learning Analytics works under a Creative Commons License, Attribution - NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0) 


132 


JOURNAL OF LEARNING ANALYTICS 


S ° LAR 

SOCIETY for LEARNING 
ANALYTICS RESEARCH 


(2016). De-identification in learning analytics. Journal of Learning Analytics, 3(1), 129-138. http://dx.doi.Org/10.18608/jla.2016.31.8 


Learning Environment 


» 4 



Figure 1: The proposed conceptual de-identification-learning analytics framework 

The next step is the de-identification process where techniques to convert personal and private 
information into anonymized data take place. De-identification techniques include such methods as 
anonymization, masking, blurring, and perturbation. The last step includes the de-identified data linked 
with a unique descriptor that may be examined by learning analytics researchers and benefit 
stakeholders, but ultimately must be used only to the advantage of students. 


3.1 De-Identification Techniques 


In our proposed de-identification-learning analytics conceptual framework, there are several techniques 
available to de-identify student data records. Figure 3 lists several methods of de-identification and 
provides examples (based on Article 29 Data Protection Working Party, 2014; Cormode & Srivastava, 
2009; Eurostat, 1996; Petersen, 2012). 


Anonymization 

Data anonymization techniques have recently been keenly researched in different structured data 
records with the goal of guaranteeing the privacy of sensitive information against unintended disclosure 
and a variety of attacks (Cormode & Srivastava, 2009). Ohm (2010) defined reasons behind 
anonymization when organizations want to release the data to the public, sell the information to third 
parties, or share the information within the same organization. The difference between anonymization 
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and de-identification, however, is quite misunderstood. Anonymization principles are a subset of holistic 
de-identification methodologies. Data anonymization is the process of de-identifying data while 
preserving its original format (Raghunathan, 2013). In the educational context, anonymization refers to 
different procedures to de-identify student data in such a way that it cannot be re-identified (the 
opposite of de-identification) unless there is a record code. Anonymization is not reserved only for 
tabular data records, but can also be applied to other types of data — such as visualized data or graphs 
— where institutions intend to present their outcomes without revealing sensitive information. 

On the other hand, in addition to anonymization, de-identification includes masking, randomization, 
blurring, and so on. For instance, replacing "Bernard" with "$$$$$$$" is a method of masking while 
altering "Bernard" to "Wolfgang" would be an example of anonymization. However, masking and 
blurring are not as well known as anonymization. By any means, de-identification, pseudonymization, 
and anonymization are interchangeable topics under the information concealing umbrella. To clarify the 
differences in simple terms, pseudonymization means cloaking the original data with false information 
with the ability to track it back to its original formation; anonymization, conversely, cannot be reversed 
(Raghunathan, 2013). 

As previously mentioned, educational data records may include private information, such as name or 
student ID, which singularly are called direct identifiers. Removing or hiding these identifiers does not 
assure a true data anonymization. Identifiers could be linked with other information that would allow 
identification of individuals (see Figure 2). However, quasi-identifiers can be used to ensure better de¬ 
identification of data. "Date of Birth + Sex + Name" is an example of a quasi-identifier. In 2006, AOL 
released the search records of 500,000 of its users. Several days after AOL's database release, New York 
Times journalists were able to reveal the identity of a 62-year-old widow using a similar process to that 
shown in Figure 2 (Soghoian, 2007). AOL admitted that the data release was a mistake and the research 
team responsible for sharing the data was fired. 
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004 
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Figure 2: Linking data sources leads to name identification 


Another example of identifying individuals was reported in 2000 when demographic information led to 
retrieving the names and contact information of patients whose medical data had been released in the 
United States (Sweeney, 2000). 
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Samarati and Sweeney (1998) provided a well-known anonymization technique, namely k- 
anonymization. This method addresses the problem of linking records to identify the individual's 
information when releasing data, thus safeguarding anonymity. The k-anonymity technique focuses on 
avoiding a data record from being identified with k individuals (Cormode & Srivastava, 2009). 


De-Identification Techniques 


Technique 


Name 

Last name 

E-mail 

Course 

Grade 

Kathrine 

Ebeela 

k e@gmx.at 

GOL 1.0 

70% 

Hadeel 

Ismael 

h_i@gmx.at 

MEK1.1 

85% 


Explanation 


Hashing 


Suppression 


Masking 


Swapping 


Noising 


Kathrine 

6cbe65cl60 

GOL 1.0 

70% 

Hadeel 

386f43fab5 

MEK 1.1 

85% 


Last name and email are hashed into a special 
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GOL 1.0 
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Ebeela 
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MEK1.1 

80% 


Add a fixed percentage value to students' grades 


Figure 3: Examples of de-identification techniques 

Masking 

Masking is a de-identification technique that replaces sensitive data with fictional data in order to 
disclose results outside the institution. Data masking can modify the data records so that they remain 
usable while keeping personal information confidential. For instance, character masking replaces a 
string with special characters. 


Blurring 

Blurring involves reducing precision to minimize the identification of data. There are several ways to 
achieve blurring, such as dividing the data into subcategories, randomizing the data fields, or adding 
noise to data records. 


3.2 Coding Data Records 

In scientific research, data usually requires further investigation with researchers looking deeper into 
the details. Having de-identified data might be insufficient for these purposes; researchers may require 
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additional information in order to do more analysis. The American federal Health Insurance Portability 
and Accountability Act (HIPAA), which is responsible for protecting the confidentiality of patient records, 
authorizes using an "assigned code" that can be appended to the records in order to permit the 
information to be re-identified for research purposes. 3 Based on that HIPAA rule, we found that FERPA 
99.31(b) allows for using a unique descriptor for student data records in order to match an individual's 
information for research and institutional use. Accordingly, we conclude that assigning a code to student 
records in our proposed framework can grant learning analytics researchers the ability to study 
behaviours of specific students and, therefore, can benefit learners. Despite the fact that learning 
analytics poses ethical challenges, the main goal is still to benefit learning environments and students, 
such as making recommendations, classifying students into profiles or predicting their performance 
(Ebner & Schon, 2013; Greller & Drachsler, 2012; Slade & Prinsloo, 2013; Khalil & Ebner, 2015a; Khalil, 
Kastl & Ebner, 2016). 

4 LIMITATIONS 

Despite the fact that de-identification protects confidential information and privacy, the de-identified 
data still poses some privacy risks (Petersen, 2012). In many cases, some attributes are capable of 
identifying individuals; in other cases, attackers can link records together from different sources and 
therefore "code break" the de-identification. On the other hand, in their paper "Privacy, Anonymity, and 
Big Data in the Social Sciences," Daries et al. (2014) assured that with de-identification, there is no 
guarantee of keeping the analysis process uncorrupted. Pardo and Siemens agree that "data can be 
either useful or perfectly anonymous, but never both" (2014, p. 447). The bottom line is that the stricter 
the de-identification guidelines, the greater the negative affect on the ultimate analysis. 

5 CONCLUSION 

Since learning analytics first became known in 2011, it has helped learners to improve their performance 
based on analyzing their educational data. Nevertheless, this field raises many issues related to ethics 
and ownership. The massive scale of data collection and analysis leads to questions about the consent 
and privacy of personal information. This paper mainly discusses one of the attainable solutions for 
preserving learners' sensitive information, the "de-identification of data" to facilitate learning analytics 
applications. We shed light on this topic via US and EU regulations regarding data privacy. We proposed 
a conceptual approach with examples of de-identification techniques that assist us with our "iMooX" 
platform (http://www.imoox.at) and can help learning analytics specialists preserve confidential learner 
information. 

Although de-identification is not a foolproof solution for protecting learner privacy, it is an imperative 
consideration in examining the ethical dimensions of learning analytics. 


3 Rule 45 C.F.R. § 164.514(c). 
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